Data Processing
Visualization


Michael Clark
Statistician Lead


2016-08-07

Outline

Part 1

  • Overview of Data Structures
  • Input/Output
  • Vectorization and Apply functions

Part 2

  • Pipes, and how to use them
  • plyr, dplyr, tidyr
  • data.table

Part 3

  • Visualization with ggplot2
  • Adding Interactivity

Part 1

Data Structures

Data structures

R has several core data structures:

  • Vectors
    • Factors
  • Lists
  • Matrices/arrays
  • Data frames

Vectors

Vectors form the basis of R data structures.

Two main types- atomic and lists, but I will treat lists separately.

Here is an R vector. The elements of the vector are numeric values.

x = c(1, 3, 2, 5, 4)
x
[1] 1 3 2 5 4

Vectors

All elements of an atomic vector are the same type. Examples include:

  • characters
  • numeric (double)
  • integer
  • logical

Factors

A important type of vector is a factor. Factors are used to represent categorical data structures.

x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x
[1] q              V              what the heck?
Levels: q V what the heck?

While the underlying representation is numeric, factors are categorical, and so can’t be used as numbers would be.

as.numeric(x)
[1] 1 2 3
sum(x)
Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors

Matrices

When we move to multiple dimensions, we are dealing with arrays.

Matrices are 2-d arrays, and extremely commonly used.

In R, the vectors making up a matrix must all be of the same type.

  • e.g. all values in a matrix must be numeric.

Creating a matrix

Creating a matrix can be done in a variety of ways.

# create vectors
x = 1:4
y = 5:8
z = 9:12

rbind(x, y, z)   # row bind
  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    5    6    7    8
z    9   10   11   12
cbind(x, y, z)   # column bind
     x y  z
[1,] 1 5  9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Lists

Lists in R are highly flexible objects that can contain anything as their elements, even other lists.

  • unlike vectors, whose elements must be of the same type.

Here is an R list. We use the list function to create one.

x = list(1, "apple", list(3, "cat"))
x
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[[3]][[1]]
[1] 3

[[3]][[2]]
[1] "cat"

Lists

We can use a loop to see the class of each object in the elements.

for(elem in x) class(elem)

Lists can be and are often named.

x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]
$b
[1] -1

Data Frames

Data frames are the most commonly used data structure.

Unlike matrices, they do not have to have the same type of element.

This is because the data.frame class is actually just a list.

As such, everything about lists applies to data.frames, but they can also be indexed by row or column as well like matrices.

Creating a data frame

mydf = data.frame(a = c(1,5,2),
                  b = c(3,8,1))

We can add row names also.

rownames(mydf) = paste0('row', 1:3)
mydf
     a b
row1 1 3
row2 5 8
row3 2 1

Input/Output

Input/Output

Standard methods of reading in data

  • read.table
  • read.csv
  • readLines

Using the foreign package:

  • read.spss
  • read.xport

Note: the foreign package is no longer useful for Stata files.

Newer approaches

haven: Package to read in foreign statistical files

  • read_spss
  • read_dta

readxl: for excel files

Faster approaches

readr: Faster versions of base R functions

  • read_csv
  • read_delim

These make assumptions after an initial scan of the data.

If you don’t have ‘big’ data, this won’t help much.

However, they actually can be used as a diagnostic

  • pick up potential data entry errors.

Faster approaches

data.table: faster read.table

  • fread

Typically faster than readr approaches.

Other Data

Note that R has many resources for dealing with other types of data.

Some examples:

  • JSON
  • SQL
  • XML
  • YAML
  • MongoDB
  • NETCDF
  • text (e.g. a novel)
  • shapefiles
  • google spreadsheets

And many, many others.

On the horizon

feather: designed to make reading and writing data frames efficient

Works in both Python and R.

Still in early stages of development.

Indexing

Base R Indexing Refresher

Slicing vectors

letters[4:6]
[1] "d" "e" "f"
letters[c(13,10,3)]
[1] "m" "j" "c"

Slicing matrices/data.frames

myMatrix[1, 2:3]

Base R Indexing Refresher

Label-based indexing:

mydf['row1', 'b']

Position-based indexing:

mydf[1, 2]

Base R Indexing Refresher

Mixed indexing:

mydf['row1', 2]

If the row/column value is empty, all rows/columns are retained.

mydf['row1',]
mydf[,'b']

Base R Indexing Refresher

Non-contiguous:

mydf[c(1,3),]

Boolean:

mydf[mydf$a >=2,]

Base R Indexing Refresher

List/Data.frame extraction

[ : grab a slice of elements/columns

[[ : grab specific elements/columns

$ : grab specific elements/columns

List/Data.frame extraction

my_list_or_df[2:4]
my_list_or_df[['name']]
my_list_or_df$name

Vectorization

Boolean Indexing

We can take values corresponding to some operation that results in TRUE or FALSE.

Assume x is a vector of numbers.

idx = x > 2
idx
x[idx]

Flexiblity

We actually don’t have to create a Boolean object before using it.

R indexing is ridiculously flexible.

x[x > 2]
x[x != 3]
x[ifelse(x > 2, T, F)]
x[{y = idx; y}]

Vectorized operations

Consider the following loop:

for (i in 1:nrow(mydf)) {
  check = mydf$x[i] > 2
  if (check==TRUE){
    mydf$y[i] = 'Yes'
  } else {
    mydf$y[i] = 'No'
  }
}

Vectorized operations

Compare:

mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'

This gets us the same thing.

Vectorized operations

Boolean indexing provides an example of a vectorized operation.

The whole vector is considered rather than each element individually.

This is always faster.

Vectorized operations

Log all values in a matrix.

mymatrix_log = log(mymatrix)

Would be a lot faster than looping over elements, rows or columns.

Vectorized Operations

Many vectorized functions already exist in R.

They are also often written in C, Fortran etc., and so even faster.

Apply functions

In R there a family of functions that allow for a succinct way of looping.

  • apply
  • lapply, sapply, vapply
  • tapply
  • mapply
  • replicate

Apply functions

  • apply
    • arrays, matrices, data.frames
  • lapply, sapply, vapply
    • lists, data.frames, vectors
  • tapply
    • grouped operations (table apply)
  • mapply
    • multivariate version of sapply
  • replicate
    • similar to sapply

Example

Standardizing variables

for (i in 1:ncol(mydf)){
  x = mydf[,i]
  for (j in 1:length(x)){
    x[j] = (x[j] - mean(x))/sd(x)
  }
}

This would be a really bad way to use R.

stdize <- function(x) {
  (x-mean(x))/sd(x)
}

## apply(mydf, 2, stdize)

Timings

Unit: milliseconds
       expr         min         lq       mean     median         uq        max neval
 doubleloop 3404.268719 3483.89474 3516.43856 3529.94528 3554.93860 3598.25409    25
 singleloop   33.967538   35.81762   37.70332   37.34829   38.70666   43.65861    25
       plyr  141.248976  148.02459  157.58312  150.56991  163.47453  198.68966    25
      apply   36.181748   39.19026   40.73384   40.04221   41.64133   47.46504    25
 vectorized    8.059534   10.46707   13.50199   11.19833   12.62214   53.21460    25

Apply functions

Benefits

  • Cleaner/simpler code
  • Possibly reproducible
    • more likely to use generalizable functions
  • Parallelizable

NOT faster than explict loops.

  • single loop over columns was as fast as apply
  • Replicate and mapply are especially slow

ALWAYS can potentially be faster than loops.

  • Parallelization: parApply, parLapply etc.

Personal experience

I use R every day, and I only use a loop for a sequential operation.

  • Note: no speed difference for a for loop vs. using while
  • If you must use an explicit loop, create an empty object and fill in
    • Fastet

I never use a double loop.

Apply functions

Using apply functions should be a part of your regular R experience.

Other versions we’ll talk about have been optimized.

However, you need to know the basics in order to use those.

Any you still may need parallel versions.

Part 2

Pipes

Note: more detail on much of this part is given in another workshop.

Pipes

Operators that send what comes before to what comes after.

There are many different pipes.

There are many packages that use their own.

However, the vast majority of packages use the same pipe:

%>%

Pipes

Here, we’ll focus on their use with the dplyr package.

Later, we’ll use it for visualizations.

Example.

mydf %>% 
  select(var1, var2) %>% 
  filter(var1 == 'Yes') %>% 
  summary

Start with a data.frame, select columns from it, filter/subset it, get a summary.

Using created variables as they are created

We can use variables as soon as they are created.

mydf %>% 
  mutate(newvar1 = var1 + var2,
         newvar2 = newvar1/var3) %>% 
  summarise(newvar2avg = mean(newvar2))

Pipes for Visualization (more later)

Generic example.

basegraph %>% 
  points %>%
  lines %>%
  layout

The dot

Sometimes you’ll want to use functions in a way in which they won’t be aware of the piped object or the columns in it.

Example: pipe to a modeling function

mydf %>% 
  lm(y ~ x)  # error

While other pipes can do this (e.g. %$% in magrittr) one can use a dot.

  • The dot refers to the object before the pipe.
mydf %>% 
  lm(y ~ x, data=.)

Flexibility

Piping is not just for data.frames.

  • The following starts with a character vector.
  • Sends it to a recursive function (named ..).
  • .. is created on-the-fly.
  • After the function is created, it’s used on ., representing the string.
  • Result: pipes between the words.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
  .. <-  . %>%
    if (length(.) == 1)  .
    else paste(.[1], '%>%', ..(.[-1]))
  ..(.)
} 
[1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
  • Put that in your pipe and smoke it René Magritte!

Pipes

Pipes are best used interactively.

Extremely useful for data exploration.

Common in many visualization packages.

See the magrittr packge for more pipes.

plyr, dplyr, tidyr

plyr

Original data managment package of the three.

More general than dplyr.

Not as useful for most common operations, but contains:

  • more flexible versions of the apply family
  • some very useful functions not found elsewhere

plyr

adply, dlply etc.

  • First letter is the current object
  • Second letter is the returned object
library(plyr)
x = list(var1=1:5, var2=2:6)
ldply(x)
   .id V1 V2 V3 V4 V5
1 var1  1  2  3  4  5
2 var2  2  3  4  5  6
ldply(x, sum)
   .id V1
1 var1 15
2 var2 20

Option to parallelize.

plyr: some useful functions

*ply: apply style functions, with parallel capability

join_all: Recursively join a list of data frames

rbind.fill: Combine data.frames by row, filling in missing columns.

mapvalues/revalue: replace values

round_any: Round to multiple of any number.

dplyr

Grammar of data manipulation.

Next iteration of plyr.

Focused on tools for working with data frames.

  • Over 100 functions

It has three main goals:

  • Make the most important data manipulation tasks easier.

  • Do them faster.

  • Use the same interface to work with data frames, a data tables or database.

dplyr

Some key operations:

select: grab columns

  • select helpers: one_of, starts_with, num_range etc.

filter/slice: grab rows

group_by: grouped operations

mutate/transmute: create new variables

summarize: summarise/aggregate

do: arbitrary operations

dplyr

Various join/merge functions.

Little things like:

  • n, n_distinct, nth, n_groups, count, recode, between

No need to quote variable names.

An example

Let’s say we want to select from our data the following variables:

  • Start with the ID variable
  • The variables X1:X10, which are not all together, and there are many more X columns
  • The variables var1 and var2, which are the only var variables in the data
  • Any variable that starts with XYZ

How might we go about this?

Some base R approaches

Tedious, or typically two steps just to get the columns you want.

# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]

# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]

# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)

More

What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]

And this is for fairly straightforward operations.

An alternative

newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>% 
  top_n(var2, n=50) %>% 
  arrange(desc(var1))

An alternative

dplyr and piping is an alternative

  • you can do all this sort of stuff with base R
  • with, within, subset, transform, etc.

Even though the initial base R approach depicted is fairly concise, it still can be potentially:

  • noisier
  • less legible
  • less amenable to additional data changes
  • requires esoteric knowledge (e.g. regular expressions)
  • often requires new objects (even if we just want to explore)

tidyr

Two primary functions for manipulating data

  • gather: wide to long
  • spread: long to wide

Other useful functions include:

  • unite: paste together multiple columns into one
  • spread: complement of unite

Example

library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
                      X = rnorm(10, 0, 1),
                      Y = rnorm(10, 0, 2),
                      Z = rnorm(10, 0, 4) )
stocks %>% head
        time          X           Y          Z
1 2009-01-01 -1.1866420  0.61441664 -2.1011634
2 2009-01-02 -1.9764097 -3.48103599  5.8686859
3 2009-01-03  0.3170110 -1.35598840  4.1802379
4 2009-01-04 -0.5648983  1.44583225 -0.7076149
5 2009-01-05  1.0914138  3.81377475 -3.9097850
6 2009-01-06 -0.5586778  0.03054858 -3.9985824
stocks %>% gather(stock, price, -time) %>% head
        time stock      price
1 2009-01-01     X -1.1866420
2 2009-01-02     X -1.9764097
3 2009-01-03     X  0.3170110
4 2009-01-04     X -0.5648983
5 2009-01-05     X  1.0914138
6 2009-01-06     X -0.5586778

Personal Opinion

I find the dplyr grammar to be clear for a lot of standard data processing.

The best usage of it is for on-the-fly data exploration and visualization.

  • No need to create/overwrite existing objects
  • Can overwrite columns as they are created
  • Makes it easy to look at anything, and do otherwise tedious data checks

Drawbacks:

  • not as fast as data.table for many things
  • the mindset can make for unnecessary complication
    • e.g. no need to pipe etc. to create one new variable

On the horizon

multidplyr

Partitions the data across a cluster.

Faster than data.table (after partitioning)

Data.Table

data.table

data.table works in a notably different way than dplyr.

However, you’d use it for the same reasons.

Like dplyr, the data objects are both data.frames and a package specific class.

Faster subset, grouping, update, ordered joins and list columns

data.table

In general, data.table works with brackets as in base R.

However, the brackets work like a function call and have several arguments.

  • Key arguments
x[i, j, by, keyby, with = TRUE, ...]

Importantly, you can’t use the brackets as you would with data.frames.

library(data.table)
df = data.table(x=rnorm(6), g=1:3, y=runif(6))
df[,4]
[1] 4

data.table

Grab rows

df[1:3,]
            x g         y
1: -0.1915347 1 0.9621328
2:  0.5369777 2 0.8398547
3:  0.8996597 3 0.3863851

Grab columns.

df[,y]
[1] 0.9621328 0.8398547 0.3863851 0.7486282 0.6252087 0.2664072

data.table

Dropping columns is awkward.

This is because the second part of the data.table object is an argument ‘j’.

df[,-y]             
[1] -0.9621328 -0.8398547 -0.3863851 -0.7486282 -0.6252087 -0.2664072
df[,-'y', with=F]
            x g
1: -0.1915347 1
2:  0.5369777 2
3:  0.8996597 3
4:  0.1508541 1
5: -0.8270485 2
6:  0.9119148 3

Grouped operations

group-by, with creation of a new variable.

Note that these actually modify df in place.

df1 = df2 = df
df[,sum(x,y), by=g]
   g       V1
1: 1 1.670080
2: 2 1.174993
3: 3 2.464367
df1[,newvar := sum(x,y), by=g]
            x g         y   newvar
1: -0.1915347 1 0.9621328 1.670080
2:  0.5369777 2 0.8398547 1.174993
3:  0.8996597 3 0.3863851 2.464367
4:  0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 1.174993
6:  0.9119148 3 0.2664072 2.464367
df1
            x g         y   newvar
1: -0.1915347 1 0.9621328 1.670080
2:  0.5369777 2 0.8398547 1.174993
3:  0.8996597 3 0.3863851 2.464367
4:  0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 1.174993
6:  0.9119148 3 0.2664072 2.464367

We can also create groupings on the fly.

df2[,newvar := sum(x,y), by=g==1]
            x g         y   newvar
1: -0.1915347 1 0.9621328 1.670080
2:  0.5369777 2 0.8398547 3.639359
3:  0.8996597 3 0.3863851 3.639359
4:  0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 3.639359
6:  0.9119148 3 0.2664072 3.639359
df2
            x g         y   newvar
1: -0.1915347 1 0.9621328 1.670080
2:  0.5369777 2 0.8398547 3.639359
3:  0.8996597 3 0.3863851 3.639359
4:  0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 3.639359
6:  0.9119148 3 0.2664072 3.639359

Faster!

  • joins: and easy to do
df1[df2]
  • group operations: via setkey
  • reading files: fread
  • character matches: e.g. via chmatch

Timings

The following demonstrates some timings from here

  • Reproduced on my own machine
  • based on 50 million observations
  • Grouped operations are just a sum and length on a vector.

By the way, never, ever use aggregate. For anything.

          fun elapsed
1:  aggregate  114.35
2:         by   24.51
3:     sapply   11.62
4:     tapply   11.33
5:      dplyr   10.97
6:     lapply   10.65
7: data.table    2.71

Ever.

Really.

Pipe with data.table

Can be done but awkward at best.

mydf[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydf[,newvar:=mean(x), 
  ][,newvar2:=sum(newvar), by=group
  ][,-'y', with=FALSE
  ]

Probably better to just use a pipe and dot approach

mydf[,newvar:=mean(x),] %>% 
  .[,newvar2:=sum(newvar), by=group] %>% 
  .[,-'y', with=FALSE]

My take

Faster methods are great to have.

  • Especially for group-by and joins.

Drawbacks:

  • You may frequently forget it doesn’t work like a data.frame
    • lost programming time
  • The syntax can be awkward
  • While one can pipe to and from data.tables, piping with it requires more brackets

Compromise

If speed and/or memory is (potentially) a concern, data.table

For interactive exploratoin, dplyr

Piping allows one to use both, so no need to choose.

Part 3

ggplot2

ggplot2

ggplot2 is an extremely popular package for visualization in R.

  • and copied in other languages/programs

It entails a grammar of graphics.

  • Every graph is built from the same few parts

Key ideas:

  • Aesthetics
  • Layers (and geoms)
  • Piping
  • Facets
  • Themes
  • Extensions

ggplot2

Strengths:

  • Ease of getting a good looking plot
  • Easy customization
  • A lot of data processing is done for you
  • Clear syntax
  • Easy multidimensional approach
  • Equally spaced colors as a default

Aesthetics

Aesthetics allow one to map data to aesthetic aspects of the plot.

  • Size
  • Color
  • etc.

The function used in ggplot to do this is aes

aes(x=myvar, y=myvar2, color=myvar3, group=g)

Layers

In general, we start with a base layer and add to it.

In most cases you’ll start as follows.

ggplot(aes(x=myvar, y=myvar2), data=mydata)

This would not produce anything except for a plot background.

Piping

Layers are added via piping.

The first layers added are typically geoms:

  • points
  • lines
  • density
  • text

ggplot2 was using pipes before it was cool, and so it has a different pipe.

Othewise, the concept is the same as before.

ggplot(aes(x=myvar, y=myvar2), data=mydata) +
  geom_point()

And now we would have a scatterplot.

Examples

library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point()

Examples

ggplot(aes(x=date, y=unemploy), data=economics) +
  geom_line()

Examples

In the following, one setting is not mapped to the data.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(size=carat, color=clarity), alpha=.25) 

Stats

There are many statistical functions built in.

One of the key strengths of ggplot is that you don’t have to do much preprocessing.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_quantile()

Stats

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth()

Stats

ggplot(mtcars, aes(cyl, mpg)) + 
  geom_point() +
  stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)

Facets

Facets allow for panelled display, a very common operation.

In general, we often want comparison plots.

facet_grid will produce a grid.

  • Often this is all that’s needed

facet_wrap is more flexible.

Both use a formula approach to specify the grouping.

facet_grid

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_grid(vs ~ cyl, labeller = label_both)

facet_wrap

ggplot(mtcars, aes(wt, mpg)) + 
  geom_point() +
  facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)

Fine control

ggplot2 makes it easy to get good looking graphs quickly.

However the amount of fine control is extensive.

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(aes(color=clarity), alpha=.5) + 
  scale_y_log10(breaks=c(1000,5000,10000)) +
  xlim(0, 10) +
  scale_color_brewer(type='div') +
  facet_wrap(~cut, ncol=3) +
  theme_minimal() +
  theme(axis.ticks.x=element_line(color='darkred'),
        axis.text.x=element_text(angle=-45),
        axis.text.y=element_text(size=20),
        strip.text=element_text(color='forestgreen'),
        strip.background=element_blank(),
        panel.grid.minor=element_line(color='blue'),
        legend.key=element_rect(linetype=4),
        legend.position='bottom')



Themes

In the last example you saw two uses of a theme.

  • built-in
  • specific customizations

Each argument takes on a specific value or an element function:

  • element_rect
  • element_line
  • element_text
  • element_blank

Themes

The base theme is not too good.

  • not for web
  • doesn’t look good for print either

You will almost invariably need to tweek it.

Extensions

While many contributed before, ggplot2 now has an extension system.

There is even a website to track the extensions.

Examples include:

  • additional themes
  • interactivity
  • animations
  • marginal plots
  • network graphs

Summary ggplot2

ggplot2 is an easy to use, but powerful visualization tool.

Allows one to think in many dimensions for any graph:

  • x
  • y
  • color
  • size
  • opacity
  • facet

2d graphs are not useful for conveying anything but the simplest ideas.

Use ggplot2 to easily go beyond 2d for interesting visualizations.

Packages

ggplot2 is the most widely used package for visualization in R.

However, it is not interactive by default.

Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.

Packages

General:

  • plotly
    • used also in Python, Matlab, Julia, aside from many interactive plots, can convert ggplot2 images to interactive ones.
  • ggvis
    • interactive successor to to ggplot though not currently actively developed
  • rbokeh
    • like plotly, it also has cross program support

Specific functionality:

  • DT
    • interactive data tables
  • leaflet
    • maps with OpenStreetMap
  • dygraphs
    • time series visualization
  • visNetwork
    • Network visualization

Piping for Visualization

One of the advantages to piping is that it’s not limited to dplyr style data management functions.

Any R function can be potentially piped to, and we’ve seen several examples so far.

This facilitates data exploration, especially visually.

  • don’t have to create objects
  • new variables are easily created and subsequently manipulated just for vis
  • data manipulation not separated from visualziation

htmlwidgets

Many newer visualization packages take advantage of piping as well.

htmlwidgets is a package that makes it easy to use R to create javascript visualizations.

  • i.e. what you see everywhere on the web.

The packages using it typically are pipe-oriented and produce interactive plots.

plotly example

A couple demonstrations with plotly.

Note the layering as with ggplot2.

Piping used before plotting.

library(plotly)
midwest %>% 
  filter(inmetro==T) %>% 
  plot_ly(x=percollege, y=percbelowpoverty, mode='markers') 

plotly example

Plotly has modes, which allow for points, lines, text and combinations.

Traces work similar to geoms.

library(mgcv)

mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  plot_ly(x=wt, y=mpg, color=amFactor, width=800, height=500, mode='markers') %>% 
  add_trace(x=wt, y=prediction, alpha=.5, hover=hovertext, name='gam prediction')

ggplotly

The nice thing about plotly is that we can feed a ggplot to it.

It would have been easier to use geom_smooth, so let’s do so.

gp = mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual')),
         hovertext = paste(wt, mpg, amFactor),
         prediction = predict(gam(mpg~s(wt), data=mtcars))) %>% 
  arrange(wt) %>% 
  ggplot(aes(x=wt, y=mpg)) +
  geom_smooth() +
  geom_point(aes(color=amFactor))
ggplotly()

dygraphs

Dygraphs are useful for time-series.

library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>% 
  dygraph(width=800) %>% 
  dyOptions(stackedGraph = TRUE, colors=RColorBrewer::brewer.pal(3, name='Dark2')) %>%
  dyRangeSelector(height = 20)

visNetwork

library(visNetwork)
visNetwork(nodes, edges, height=600, width=800) %>% 
  visNodes(shape='circle', 
           font=list(), 
           scaling=list(min=10, max=50, label=list(enable=T))) %>% 
  visLegend()

data table

library(DT)
movies %>% 
  select(1:6) %>% 
  filter(rating>9) %>% 
  slice(sample(1:nrow(.), 50)) %>% 
  datatable(rownames=F)

Shiny

Shiny is a framework that can essentially allow you to build an interactive website.

  • Provided by RStudio developers

Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.

Interactive and Visual Data Exploration

Interactivity allows for even more dimensions to be brought to a graphic.

Interactive graphics are more fun too!

Just a couple visualization packages can go a very long way.

Summary